In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
Alright now we’re going to be working with the running data set. so let’s go ahead and import it. We’ll see again that it’s not converting the dates so we’ve got to do that manually. However it doesn’t work exactly like it worked last time.
In [2]:
df = pd.read_csv('../data/date_fixed_running_data.csv')
In [3]:
df.head()
Out[3]:
This is because when you save a data frame to a csv it doesn’t label the index column. So now our column is actually the ‘zero’ column. When you use parse_dates you can use either the column name (if available) or the 0-based column index number. This happened because the index didn’t have a name when we saved the last csv.
In [4]:
df['Unnamed: 0'].head()
Out[4]:
In [5]:
df = pd.read_csv('../data/date_fixed_running_data.csv', parse_dates=['Date'])
We can import the date correctly by specifying the zero based look up or by specifying the name like we did in the last video.
In [6]:
df = pd.read_csv('../data/date_fixed_running_data.csv', parse_dates=[0])
In [7]:
df.head()
Out[7]:
Let’s go ahead and rename the columns and set our index to the dates again.
In [8]:
cols = ['Date', 'Miles', 'Time']
df.columns = cols
df.head()
Out[8]:
We can also rename the column using the rename method.
In [9]:
df.rename(columns={df.columns[0]:'Date'}, inplace=True)
In [10]:
df.head()
Out[10]:
Now we can plot it again once we set the index.
In [11]:
df.set_index('Date', inplace=True)
In [12]:
df.plot()
Out[12]:
Now one thing we’re going to want to do is get a breakdown of times for each run. Stats like minutes per mile and that sort of thing.
To do that we’re going to use map. In order to convert to seconds we’re going to need to convert our times represented as hour/min/seconds into just pure seconds. From that we can do minutes etc.
Let’s go ahead and write our map function. Now there is an edge case that we need to handle. Remember we don’t have time data for every single run so we’ve got to handle the NaN values correctly.
we'll do this by just keeping them in there.
In [16]:
raw_time_fmt = df.Time[0]
def get_total_seconds(raw_time):
if raw_time is np.nan:
return np.nan # if it's blank, keep it blank
hrs, mins, seconds = str(raw_time).split(':')
seconds = int(seconds) + 60 * int(mins) + 60 * 60 * int(hrs)
return seconds
print(get_total_seconds(raw_time_fmt))
In [17]:
df['Seconds'] = df.Time.map(get_total_seconds)
In [18]:
df.head(10)
Out[18]:
Now we can describe and see I recorded times for 52 of the runs.
In [19]:
df.describe()
Out[19]:
This is a great example of what np.NaN is so great in pandas. When we do this describe we get answers unlike in numpy and it just disregards the empty values. If we were to fill it in with zeros that would drag down our statistics.
We can see that here. See how much it changes the data when it is filled with 0 which would be incorrect analysis.
In [20]:
df.fillna(0).describe()
Out[20]:
Now that we have seconds it would be useful to see minutes too. I find it hard to think in seconds.
So let’s add a new column.
In [21]:
df['Minutes'] = df['Seconds'].map(lambda x: x / 60)
In [22]:
df.describe()
Out[22]:
Now let’s do a describe and plot it again.
In [23]:
df.plot(x='Miles', y='Minutes', kind='scatter')
Out[23]:
Let’s plot Miles and Minutes together in a scatter plot. Wow that’s linear. Let’s see how correlated they are. We do this with the cor method. We can see that Miles to time are very tightly correlated (using pearson standard correlation coefficients) there are two other correlation methods that you can use, kendall Tau, and Spearman rank correlation.
In [24]:
df.corr()
Out[24]:
In [25]:
df.corr(method='kendall')
Out[25]:
In [26]:
df.corr(method='spearman')
Out[26]:
Now let’s see a box plot. With these two we get a much better idea of the data. We can see that most of my runs are below an hour except for a couple that are much longer.-
In [27]:
df.boxplot('Minutes', return_type='axes')
Out[27]:
Now let’s add minutes per mile, we can just divide our two series to get those numbers.
In [28]:
df['Minutes'] / df['Miles']
Out[28]:
In [29]:
df['Min_per_mile'] = df['Minutes'] / df['Miles']
In [30]:
df.describe()
Out[30]:
We can see that along more shorter distances, my speed can vary a lot.
In [31]:
df.plot(x='Miles', y='Min_per_mile', kind='scatter')
plt.ylabel("Minutes / Mile")
Out[31]:
Let’s see a histogram of my speeds. Histograms are a great way of representing frequency data or how much certain things are occuring.
In [32]:
df.hist('Min_per_mile')
Out[32]:
seems pretty center in that 7 minutes to 7.5 minute range. Let’s see if we can get more information with more bins which we specify with the bin argument.
In [33]:
df.hist('Min_per_mile',bins=20)
Out[33]:
That’s interesting. Under 7 and then at 7.5 are the most popular. I bet that has something to do with my running distances too or the courses I choose to run.
In [34]:
df.hist('Min_per_mile',bins=20, figsize=(10,8))
plt.xlim((5, 11))
plt.ylim((0, 12))
plt.title("Minutes Per Mile Histogram")
plt.grid(False)
plt.savefig('../assets/minutes_per_mile_histogram.png')
In [35]:
df['Miles']
Out[35]:
Now another cool thing you can do with time series is see the rolling mean or rolling sum or even rolling correlations. There’s a lot of different “rolling” type things you can do.
In [36]:
df['Miles'].plot()
Out[36]:
So here’s a standard plot of our Miles again, just a line over time. To add another line to the same plot we just add more details to the box. As I was touching on the rolling values. Let’s talk about the rolling average. Now to do that I pass it a series or a data frame.
In [37]:
df['Miles'].plot()
pd.rolling_mean(df['Miles'], 7).plot()
Out[37]:
I can do the same with the rolling standard deviation or sum.
In [38]:
df['Miles'].plot()
pd.rolling_std(df['Miles'], 7).plot()
Out[38]:
In [39]:
df['Miles'].plot()
pd.rolling_sum(df['Miles'], 7).plot()
Out[39]:
Now on the last note one thing that’s cool about date time indexes is that you can query them very naturally. If I want to get all my runs in october of 2014, I just enter that as a string.
In [40]:
df.index
Out[40]:
If I want to get from November to December, I can do that as a Series.
In [41]:
df['2014-11':'2014-12']
Out[41]:
How do you think we might go from october to January 1 2015?
Go ahead and give it a try and see if you can figure it out.
In [42]:
df['2014-11':'2015-1-1']['Miles'].plot()
Out[42]:
Now we can specify a series this way but we can’t specific a specific date. To get a specific date’s run.
In [43]:
df['2014-8-12']
To do that we need to use loc.
In [44]:
df.loc['2014-8-12']
Out[44]:
now that we’ve done all this work. We should save it so that we don’t have to remember what our operations were or what stage we did them at. Now we could save it to csv like we did our other one but I wanted to illustrate all the different ways you can save this file.
Let’s save our csv, but we can also save it as an html page(which will give us a table view) or a json file.
In [45]:
df.head()
Out[45]:
In [46]:
df.to_csv('../data/date_fixed_running_data_with_time.csv')
In [47]:
df.to_html('../data/date_fixed_running_data_with_time.html')
One thing to note with JSON files is that they want unique indexes (because they're going to be come the keys), so we've got to give it a new index. We can do this by resetting our index or setting our index to a column.
In [48]:
df.to_json('../data/date_fixed_running_data_with_time.json')
In [49]:
df.reset_index()
Out[49]:
In [50]:
df['Date'] = df.index
In [51]:
df.index = range(df.shape[0])
In [52]:
df.head()
df.to_json('../data/date_fixed_running_data_with_time.json')
Now there’s a LOT more you can do with date time indexing but this is about all that I wanted to cover in this video. We will get into more specifics later. By now you should be getting a lot more familiar with pandas and what the ipython + pandas workflow is.
In [53]:
df.Date[0]
Out[53]:
In [ ]: